Map maker, map maper, make me a map!

Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.

Death from Heart Disease

Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC

#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)

heart_disease <- read_excel("./data/hd_all.xlsx", 
                            na="Insufficient Data")

head(heart_disease)
## # A tibble: 6 x 4
##   State   County  Death_Rate FIPS_Code
##   <chr>   <chr>        <dbl>     <dbl>
## 1 Alabama Autauga        463      1001
## 2 Alabama Baldwin        391      1003
## 3 Alabama Barbour        533      1005
## 4 Alabama Bibb           511      1007
## 5 Alabama Blount         426      1009
## 6 Alabama Bullock        483      1011

OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. We’ll be ignoring them.

Introducing: Maps

There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. ggplot2 works in tandem with the maps package to provide a few standardized sets of maps for easy plotting. Let’s take a look at one of counties in the U.S. lower 48.

#install the mapdata library if you
#don't have it
library(ggplot2)

#map_data gets us one of the select maps
map_df <- map_data("county")

head(map_df)
##        long      lat group order  region subregion
## 1 -86.50517 32.34920     1     1 alabama   autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga

OK, we have latitude and longitude of county borders, a group (each county has one group), and both a region and subregion. Note we don’t have states and counties - this map is a bit broader than that. It includes cities and US Territories. Also note that capitalization is wonky - it’s all lower case.

To show you how we would use this data

ggplot(data=map_df, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()

geom_polygon is our newest friend, and it creates polygons from the lat/long paths above. There are other ways to use this map, but it’s the clearest.

Getting ready to join maps and data

OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the two data frames. Note, the number of rows in map_df is YUGE compared to those in our data set, as they define county borders. So we’re going to be merging one data point to many rows in the map_df data frame. But, we need to do some prep work first.

First, we need to fix up the columns in our heart_disease data set to not be capitalized, and we need to rename those columns in our map data frame.

map_df <- map_df %>%
  rename(State = region,
         County = subregion)

heart_disease <- heart_disease %>%
  mutate(State = tolower(State), 
         County = tolower(County))

Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join

bad_match <- anti_join(heart_disease, map_df)
## Joining, by = c("State", "County")
nrow(bad_match)
## [1] 112
head(bad_match)
## # A tibble: 6 x 4
##   State   County         Death_Rate FIPS_Code
##   <chr>   <chr>               <dbl>     <dbl>
## 1 alabama dekalb                524      1049
## 2 alabama st. clair             422      1115
## 3 alaska  aleutians east        106      2013
## 4 alaska  aleutians west        217      2016
## 5 alaska  anchorage             253      2020
## 6 alaska  bethel                359      2050

Uh oh! 112 rows of bad matches! Why? Well, we can see one of the reasons quickly - the US Virgin Islands. They are not in the map data set. Second, though, we can see st. croix - indeed, the map data frame has no . or ’ characters in it, so, we should filter those out of heart_disease and then re-check.

heart_disease <- heart_disease %>%
  mutate(County = gsub("\\.", "", County)) %>%
  mutate(County = gsub("\\'", "", County))

bad_match2 <- anti_join(heart_disease, map_df)
## Joining, by = c("State", "County")
nrow(bad_match2)
## [1] 82
bad_match2
## # A tibble: 82 x 4
##    State   County               Death_Rate FIPS_Code
##    <chr>   <chr>                     <dbl>     <dbl>
##  1 alabama dekalb                      524      1049
##  2 alaska  aleutians east              106      2013
##  3 alaska  aleutians west              217      2016
##  4 alaska  anchorage                   253      2020
##  5 alaska  bethel                      359      2050
##  6 alaska  bristol bay                  NA      2060
##  7 alaska  denali                      272      2068
##  8 alaska  dillingham                  382      2070
##  9 alaska  fairbanks north star        269      2090
## 10 alaska  haines                      265      2100
## # ... with 72 more rows

OK, much better - down to 82. And looking at what’s left, those are cities, not counties, so, should not affect our map making.

Are we good the other way?

bad_match3 <- anti_join(map_df, heart_disease)
## Joining, by = c("State", "County")
#because we get a data frame return
length(unique(bad_match3$County))
## [1] 8
unique(bad_match3$County)
## [1] "de kalb"              "washington"           "de soto"             
## [4] "du page"              "la porte"             "yellowstone national"
## [7] "la moure"             "de witt"

Only 8. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 8 is acceptable.

Joining maps and data

So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an inner_join - although we could also outer_join so that the entire map is retained, and thus see what counties we have missing data for. Try it both ways.

heart_disease_map_data <- inner_join(heart_disease, map_df)
## Joining, by = c("State", "County")

And now - let’s plot it!

heart_map <- ggplot(data=heart_disease_map_data, 
       mapping = aes(x = long, y = lat, group = group,
                     fill = Death_Rate)) +
  geom_polygon() 

heart_map

Making pretty choropleths

So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.

First, a better scale. Maybe a gradient?

heart_map +
  scale_fill_gradient(low = "white", high = "red")

OK, not bad! Not great, but much better. Are there gradients you’d prefer?

However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer function in ggplot2.

So let’s make bins and plot that instead!

heart_map_binned <- ggplot(data=heart_disease_map_data, 
       mapping = aes(x = long, y = lat, group = group,
                     fill = cut_interval(Death_Rate, 5))) +
  geom_polygon() 

heart_map_binned +
    scale_fill_brewer(palette="Reds")

MUCH nicer. Now we can really start to see some hot spots.

Faded Examples

For our faded examples, today we’re going to look at data on TB incidence from the World Health organization. This look at TB for the entire planet at the country level. Let’s take a look

library(readr)

tb_world <- read_csv("./data/who_tb_data.csv")

tb_world
## # A tibble: 3,651 x 18
##    country iso2  iso3  g_whoregion  year estimated_popul… est_incidences_…
##    <chr>   <chr> <chr> <chr>       <int>            <int>            <dbl>
##  1 Afghan… AF    AFG   EMR          2000         20093756              190
##  2 Afghan… AF    AFG   EMR          2001         20966463              189
##  3 Afghan… AF    AFG   EMR          2002         21979923              189
##  4 Afghan… AF    AFG   EMR          2003         23064851              189
##  5 Afghan… AF    AFG   EMR          2004         24118979              189
##  6 Afghan… AF    AFG   EMR          2005         25070798              189
##  7 Afghan… AF    AFG   EMR          2006         25893450              189
##  8 Afghan… AF    AFG   EMR          2007         26616792              189
##  9 Afghan… AF    AFG   EMR          2008         27294031              189
## 10 Afghan… AF    AFG   EMR          2009         28004331              189
## # ... with 3,641 more rows, and 11 more variables:
## #   est_incidences_per_100K_people_lwr <dbl>,
## #   est_incidences_per_100K_people_upr <dbl>, est_incidences <int>,
## #   est_incidences_lwr <int>, est_incidences_upr <int>,
## #   est_mortality_per_100K_people <dbl>,
## #   est_mortality_per_100K_people_lwr <dbl>,
## #   est_mortality_per_100K_people_upr <dbl>, est_mortality <int>,
## #   est_mortality_lwr <int>, est_mortality_upr <int>

There’s a lot going on here - in particular let’s focus on estimated insidences, incidences per 100K people, and mortality. upr and lwr denote upper and lower bounds to estimates.

So, a simple map of 2015.

#get a map
worldmap <- map_data("world")

#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)

#join
incidence_df <- left_join(worldmap, tb_2015, by = c("region" = "country"))

#plot
ggplot(incidence_df, aes(x = long, y = lat, 
                         group = group, fill=est_incidences)) +
  geom_polygon() +
  scale_fill_gradient(low = "blue", high = "red")

Huh - not everything joins well, but we’ll ignore that for now.

Let’s look at mortality in 2000

#get a map
worldmap <- map_data("____")


#filter to 2015
tb_2000 <- tb_world %>% filter(year == ____)

#join
incidence_df_2000 <- left_join(worldmap, ____, by = c("region" = "country"))

#plot
ggplot(incidence_df_2000, aes(x = ____, y = ____, 
                         group = group, fill=est_mortality)) +
  geom_polygon() +
  scale_fill_gradient(____)

Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill

#get a map
worldmap <- ____("____")


#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)

#join
incidence_df_2016 <- left_join(worldmap, ____, by = c("region" = "country"))

#plot
ggplot(incidence_df_2016, aes(x = ____, y = ____, 
                         group = ____, 
                         fill=____(est_incidences_per_100K_people,5))) +
  ____() +
  scale_fill_brewer(palette = "YlOrBr",
                    guide = guide_legend("Incidences per 100K"))

Exercises

  1. What does not join well? Can you fix any of them via changes in the tb or worldmap dataset?

  2. Can you make a map showing change in mortality over four years (filter to four years, and then use facet_wrap)?

Is all map data available via packages?

Most map data comes from separate files from R. You’ll see shapefiles most commonly as files that define borders. Shapefiles are funky things. Let’s take a look at the shapefile for a common classification of marine ecoregions of the world. These are coastal regions binned by biogeographic boundaries. To load them, we’ll need the sp and rgdal libraries. These might take some doing to install properly, but, it’s worth it.

library(sp)
library(rgdal)
ecoregions <- readOGR("./data/MEOW-TNC/meow_ecos.shp", layer="meow_ecos")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/MEOW-TNC/meow_ecos.shp", layer: "meow_ecos"
## with 232 features
## It has 9 fields
plot(ecoregions)

Neat - these kinds of files can be plotted on their own! But, we’re going to take this a bit further.

Looking under the hood of a shapefile

So, if we dig into this object, what do we see.

class(ecoregions)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
#It's an S4 object
slotNames(ecoregions)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"

So, it’s a SpatialPolygonsDataFrame. Weird. And it has all of these pieces to it - called slots - as it is what we call an S4 object. More on that another time, but, suffice to say, instead of the $ notation we use an @ notation to access pieces of it. Such as

head(ecoregions@data)
##   ECO_CODE                   ECOREGION PROV_CODE
## 0    20192                Agulhas Bank        51
## 1    20053            Aleutian Islands        10
## 2    20072                    Amazonia        13
## 3    20194           Amsterdam-St Paul        52
## 4    20228 Amundsen/Bellingshausen Sea        61
## 5    20109 Andaman and Nicobar Islands        24
##                           PROVINCE RLM_CODE                      REALM
## 0                          Agulhas       10  Temperate Southern Africa
## 1 Cold Temperate Northeast Pacific        3 Temperate Northern Pacific
## 2               North Brazil Shelf        4          Tropical Atlantic
## 3                Amsterdam-St Paul       10  Temperate Southern Africa
## 4       Continental High Antarctic       12             Southern Ocean
## 5                          Andaman        5       Western Indo-Pacific
##   ALT_CODE ECO_CODE_X  Lat_Zone
## 0      189        192 Temperate
## 1       50         53 Temperate
## 2       64         72  Tropical
## 3      191        194 Temperate
## 4      209        228     Polar
## 5      105        109  Tropical

Here we see the ecoregion names and some other properties of them. Other slots are filled with information that defines polygons, etc.

Using joins to make ecoregion data frame for plotting

So, to make this into a data frame for mapping, we have to work a little bit of magic.

First, we need to turn those polygons into a data frame of points. There’s a function for this in broom called tidy. We need to tell it what column we are using to separate regions

library(broom)
ecoregions_points = tidy(ecoregions, region="ECOREGION")

head(ecoregions_points)
##       long      lat order  hole piece          group           id
## 1 20.46950 43.05329     1 FALSE     1 Adriatic Sea.1 Adriatic Sea
## 2 20.96008 40.52461     2 FALSE     1 Adriatic Sea.1 Adriatic Sea
## 3 20.90520 40.49428     3 FALSE     1 Adriatic Sea.1 Adriatic Sea
## 4 19.68136 40.52051     4 FALSE     1 Adriatic Sea.1 Adriatic Sea
## 5 19.55918 40.53546     5 FALSE     1 Adriatic Sea.1 Adriatic Sea
## 6 19.38904 40.57329     6 FALSE     1 Adriatic Sea.1 Adriatic Sea

And now the last step - a join! We need to join the original data defining names of ecoregions with the data defining their borders.

ecoregions_df = inner_join(ecoregions_points, ecoregions@data, 
                           by=c("id" = "ECOREGION")) %>%
  rename(ECOREGION = id)
## Warning: Column `id`/`ECOREGION` joining character vector and factor,
## coercing into character vector

Voila! We have ecoregions in a data frame format! Now let’s plot it. Note, I’m using group here, as group fixes problems with the map wrapping round both sides of the screen. Try ECOREGION if you don’t believe me.

ggplot(data=ecoregions_df, 
       mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()

Enter Regional Temperature Information

OK, we have some polygons - now what are we going to do with them? One use is to visualize climate change over time. I’ve been working on a project to look at climate change in Ecoregions that contain kelp. So, for each Ecoregion, I’ve gotten all of the temperature data from 1950 - the present from http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html - a great source for SST data.

sst_data <- read.csv("./data/hadsst_regions.csv",
                     stringsAsFactors=FALSE)

head(sst_data)
##                                    ECOREGION   DateName     tempC Year
## 1                               Agulhas Bank 1950.01.16 21.137959 1950
## 2                           Aleutian Islands 1950.01.16  3.909404 1950
## 3                                    Bassian 1950.01.16 14.309411 1950
## 4 Beaufort Sea - continental coast and shelf 1950.01.16 -1.800000 1950
## 5                                  Cape Howe 1950.01.16 18.455363 1950
## 6                                Celtic Seas 1950.01.16 10.399248 1950

How can we use this data to see climate change on a map? There are a lot of methods. One simple one is to visualize change by decade. First, we have to use a quick trick to calculate decades - divide by ten, round to a whole number, and then multiple by ten.

sst_decadal <- sst_data %>%
  mutate(Decade = round(Year/10)*10) %>%
  group_by(ECOREGION, Decade) %>%
  summarise(tempC = mean(tempC, na.rm=T)) %>%
  ungroup()

Now, we can look at temperature by decade, but, we can’t just look at raw temperature. The color palatte would be too spread out, and we couldn’t compare, say, change in the tropics to change in the Arctic. One way that scientists have gotten around this is to look at temperature anomoly - that is, deviation from a long-term mean. So let’s calculate the decadal deviation from the long-term average for each ecoregion.

sst_decadal <- sst_decadal %>%
  group_by(ECOREGION) %>%
  mutate(tempC_anomoly = tempC - mean(tempC)) %>%
  ungroup()

OK, now we’ve got something worth plotting!

Visualizing Climate Change

There are a lot of ways we can visualize this. We can look at long-term trends, of course, and even look at each region as a single data point to get average long-term trends

ggplot(data=sst_decadal, mapping=aes(x=Decade, y=tempC_anomoly)) +
  geom_line(mapping=aes(group=ECOREGION), colour="grey") +
  theme_bw() +
  stat_smooth(method="lm", color="red", fill=NA) +
  ylab("Decadal Temperature Anomoly")

But how do we put this information on a map so that we can truly see it?

As we did with states, we can join the temperature data to the marine ecoregion data.

sst_map_data <- inner_join(sst_decadal, ecoregions_df)
## Joining, by = "ECOREGION"

Once that is done, we can use this decadal anomoly as a fill, with facets to show different decades.

ggplot(data=sst_map_data, 
       mapping = aes(x = long, y = lat, 
                     group = group, fill = tempC_anomoly)) +
  borders("world", lwd=0.1, colour="black") +
  geom_polygon(alpha=0.9) +
  facet_wrap(~Decade) +
  scale_fill_gradient(low = "blue", high = "red") +
  theme_bw(base_size=14) +
  coord_equal()

It’s a different way of looking at the same data. But what do you see? What are things we can do to make it clearer?

Faded Examples

Let’s go back to our Tb data. Here, we’re going to use a shapefile of international borders.

world_shapefile <- readOGR("./data/TM_WORLD_BORDERS_SIMPL-0.3")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data/TM_WORLD_BORDERS_SIMPL-0.3", layer: "TM_WORLD_BORDERS_SIMPL-0.3"
## with 246 features
## It has 11 fields
## Integer64 fields read as strings:  POP2005

What’s inside?

plot(world_shapefile)

head(world_shapefile@data)
##   FIPS ISO2 ISO3 UN                NAME   AREA  POP2005 REGION SUBREGION
## 0   AC   AG  ATG 28 Antigua and Barbuda     44    83039     19        29
## 1   AG   DZ  DZA 12             Algeria 238174 32854159      2        15
## 2   AJ   AZ  AZE 31          Azerbaijan   8260  8352021    142       145
## 3   AL   AL  ALB  8             Albania   2740  3153731    150        39
## 4   AM   AM  ARM 51             Armenia   2820  3017661    142       145
## 5   AO   AO  AGO 24              Angola 124670 16095214      2        17
##       LON     LAT
## 0 -61.783  17.078
## 1   2.632  28.163
## 2  47.395  40.430
## 3  20.068  41.143
## 4  44.563  40.534
## 5  17.544 -12.296

Oh neat! Notice the ISO2 and ISO3 columns? Those are designed to match the standardized columns from the tb_world data! Once we turn this into a data frame, should be a snap to merge! Let’s make the same plot as we did previously, but now with shapefile.

So, TB in 2015!

#get a map
worldmap_shp <- tidy(world_shapefile, region="NAME")

worldmap_shp_df <- left_join(worldmap_shp, world_shapefile@data,
                          by = c("id" = "NAME")) %>%
  rename(iso2 = ISO2, iso3 = ISO3)
## Warning: Column `id`/`NAME` joining character vector and factor, coercing
## into character vector
#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)

#join
incidence_df <- left_join(worldmap_shp_df, tb_2015)
## Joining, by = c("iso2", "iso3")
## Warning: Column `iso2` joining factor and character vector, coercing into
## character vector
## Warning: Column `iso3` joining factor and character vector, coercing into
## character vector
#plot
ggplot(incidence_df, aes(x = long, y = lat, 
                         group = group, fill=est_incidences)) +
  geom_polygon() +
  scale_fill_gradient(low = "blue", high = "red")

Let’s look at mortality in 2000

#get a map
worldmap_shp <- tidy(world_shapefile, region="NAME")

worldmap_shp_df <- ____(worldmap_shp, world_shapefile@data,
                          by = c("id" = "NAME")) %>%
  rename(iso2 = ISO2, iso3 = ISO3)

#filter to 2000
tb_2000 <- tb_world %>% filter(year == ____)

#join
incidence_df_2000 <- left_join(____, ____)

#plot
ggplot(incidence_df_2000, aes(x = ____, y = ____, 
                         group = group, fill=est_mortality)) +
  geom_polygon() +
  scale_fill_gradient(____)

Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill

#get a map
worldmap_shp <- ____(____, ____="NAME")

worldmap_shp_df <- ____(____, world_shapefile@____,
                          by = c("id" = "NAME")) %>%
  rename(iso2 = ISO2, iso3 = ISO3)

#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)

#join
incidence_df_2016 <- left_join(____, ____, by = c("region" = "country"))

#plot
ggplot(incidence_df_2016, aes(x = ____, y = ____, 
                         group = ____, 
                         fill=____(est_incidences_per_100K_people,5))) +
  ____() +
  scale_fill_brewer(____ = "YlOrBr",
                    ____ = guide_legend("Incidences per 100K"))

Exercises

Use the shapefile maps for this.

  1. Make maps that compare total incidence to incidence per 100K averaged across all years.

  2. Do you see different trends in invidence over time?